I would like to answer the following question using dataset of 1,599 quality ranked wines; Which chemical properties influence the quality of red wines?
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our dataset contains 13 variables with 1,559 observation. (variable X is just the number variable, so technically 12 variables to expline the feature of the wines.)
## X fixed.acidity volatile.acidity
## 0 0 0
## citric.acid residual.sugar chlorides
## 0 0 0
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 0 0 0
## quality
## 0
There are no missing values in this data set.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As it can be seen in the histogram above, most of the wine quality falls in to “5” and “6” quality bins. And it gets much more less appearance in “3”, “4”,“7”, and “8” quality bins. Mean quality is 5.636, and mode quality (or most frequent quality level, because quality variable is discrete.) is “5” from the table..
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of wines on the fix.acidity is looks some what normal, a little bit skwed positively, with median 7.90, and mean of 8.32.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution wines on valatile.acidity is also a bit skwed positively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of wine on citric.acid tells that there are large number of wines falls into very small amount of citric acid, but it looks like there also another peaks in just below the 0.5 as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of wine on resigual.sugars are very positively skewed by looking at the above histogram. Most of the wine falls in to residual sugar level between 1 and 3, but there are certain numbers of wine exist well beyond 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of wines on chlorides, looks similar to above residual sugar distribution. Sine the is the outliers with very high level of chlorides, second plots is zooming into low level of chlorides. And it looks normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of wine on free.sulfur.dioxide is also looks similar to residual sugar distribution as well.
The above histogram was constructed by taking log10 transformation of free.sulfur.dioxide, because the previous histogram was very skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Again, the distribution of wine on total.sulfur.dioxide is also skewed positively.
The above histogram is also the log10 transformation of total.sulfur.dioxide created, because the original scale histogram was very skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The distribution of wine on density looks very normally distributed with mean on 0.996.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of wine on pH is also normally distributed, with some outliers around beyond pH level 4.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of wine on sulphates is very skewed to right, the wine with sulphates level over 2.0 can be considered as outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol level distribution is skewed to the right side, and it looks the counts of wine is decreasing as alcohol level increase.
So I take the natural log transformation to the alcohol, and histogram looks easy to capture the characteristics of the distribution of alcohol level.
There are 1599 quality censored tested red wines in the data set, and 11 attributes. Most of the attributes are main components that determine the quality of wine.
Based on the sensory test, each red wines are graded in 1 to 10 level of quality. Bad 1 <<<<< 5 >>>>>. 10 Good. Quality “10” is best graded red wine, and “1” is worst graded red wine.
Although most of the attributes are some kind of chemical contents that used to determine the quality of wine, the “pH” is only objective scale variable, which describes ow acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).
Other noticeable points are; 1. Most red wine quality falls into 5 and 6. 2. Following attributes distributions are seems normally distributed; acidity, sugar, chlorodes, density, pH, sulphates. 3. Following attributes distributions are seems skewed to right; citric acid, sulfur, alcohol. 4. There are certain volume of outliers.
The main feature of this red wine dataset is to see how “quality” of the red wine, which is totally subjective scale, is determined by the components in the wine.
If wee can construct the model which is able to predict the quality of the red wine, we could set appropriate price for the red wine, quality control, develop the much better wine, and such more, by looking at the components inside the wine.
Just by thinking what the red wine labels says in ordinary liquor store, sweetness, acidity, tannin, fruit, body, it is seems like, dataset attributes such as acidity, citric acid, sugar, density, and pH might determine the quality of red wine.
And of course, the alcohol could be the one of the important attributes for the quality of wine as well, because too much alcohol is just the spirits and too low alcohol is just the grape juice.
Since it is looks like that the “quality” of the wine is discrete scale, I add new quality variable, named “quality.factor”, identical number but data type is factor scale not integer, so that it makes bivariate plot analysis easy.
## Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
I transform, free.sulfur.dioxide and total.sulfur.dioxide to log10 scale, and alcohol since these attributes distribution looks very skewed to right.
Also, I drop the variable X because it looks just the index of the red wine and not relevant to my analysis.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Just by looking at the correlation matrix above, the similar attributes, such as fixed.acidity and citric.acid and pH, two sulfur dioxide has relatively higher correlations.
And by looking at the correlations for quality, it is seems like the alcohol is the highest positive correlation for quality.
Above plot is the boxplots of the quality(factor) on the acidity related attributes.
It is seems like that the only volatile.acidity has clear relationships to quality, less volatile.acidity higher quality.
Although citric.acid seems to have positive relationship to quality by boxplot, but points plot giving me the intuition of too much 0 (or close to 0) value of citric.acid wine messing up the box plot.
Boxplot of residual.sugar on quality is hard to see the relationship because of too much outliers. Therefore, I omit residual.sugar outliers(one tail) based on Q3 + 1.5IQR rules.
By excluding outliers of residual.sugar, the boxplot became easy to see the relationship, but it looks like residual.sugar has clear relationship on quality of red wine.
Again, boxplot of chlorides has also many outliers that makes analysis unclear. Using 1.5IQR rule, omitting outliers of chlorides.
From above plot, it looks like weak negative relationship between chlorides and quality of wine.
Above plots are boxplot of two sulfur.dioxide attributes. It is difficult to see the relationship from these graphs, and because we know these distributions are very skewed right, taking log10 transformation to each sulfur.dioxide attributes.
By looking at the above plots, it can be said that the both free.sulfur.dioxide and total.sulfur.dioxide has weak negative relationship on quality of the red wine.
Boxploting, density, Ph, sulphates, and alcohol
By looking at the above graphs, it is seems like that the density has negative relationships on quality, and no relationship for pH, and alcohol has relatively strong relation ships on quality.
But again, because sulphastes has too many outliers that plot analysis makes difficult, we omit outliers with 1.5 IQR rule.
By omitting outliers, it can be seems that the sulphates might have relatively strong relationship on quality of red wine.
Looking at the correlation matrix above, I plot the following plots to see the relationships for each attributes. I selects attributes which has at least correlation level above 0.5 to any other attributes.
Even though I can see the positive relationship of alcohol and quality from the plot, seems like there are very high concentration of points on the alcohol plot. And remembering I also take the log transformation of alcohol distribution too, I plot log(alcohol) and quality of red wine as well.
Still we can see the strong positive relationship between alcohol and red wine quality.
By looking at the plots, it is looks like that fixed.acidity has positive relationship to citric.acide and density, and negative relationship on pH.
Also, it is expected that the free.sulfur.dioxide and total.sulfur.dioxide had positive correlation each other.
By observing box-points plot above, these are attributes that can be seem some relationship between the quality of wine and its direction.
Positive relationship on red wine quality - citric.acid - sulphates - alcohol
Negative relationship on red wine quality - volatile.acidity - chlorides - total.sulfur.dioxide
It is looks like that fixed.acidity has positive relationship to citric.acide and density, and negative relationship on pH.
Also, it is expected that the free.sulfur.dioxide and total.sulfur.dioxide had positive correlation each other.
By looking at the plot and correlation matrix, it can be said that alcohol has strongest positive relationship to the quality of red wine.
Although, one thing to notice that even there can be seen the positive relationship of alcohol and quality, I can only see that relationship where wine quality is 5 or above.
In other words, I can not see any clear positive relationship of alcohol and quality where the quality is “3” and “4”. (But sample are very small!!.)
Taking x as log(alcohol) ant y as citric.acid, both positive relationship attributes to quality, and I colored points depending on the quality of wine.
It is easy to see that, high alcohol and high citric.acid tends to have high quality, and low alcohol and low citric acid has low quality
But I can be seen in the graph that even though the citric.acid level is low, if the alcohol level is high, there are still good quality level of wine exit.
Above point graph is very clear that the high level of both alcohol and suplhates indicate high quality level of red wine.
Again, citric.acid and sulphates are both positive on quality as well.
As we can seen in the plot above, the both negative related attributes to quality fo wine chorides and volatile.acidity has both negative to quality when these are plotted together.
But from the graph, it is looks like volatile.acidity is more negatively correlated to quality of the red wine.
It is looks like the total.sulfur.dioxide might have negative impact on the quality of the red wine, but it might not be case as well by looking at the last plot.
Finally plotting two strongest both negative and positive attributes to the quality of alcohol;
Its clear that high level of alcohol and low level of volatile.acidity definitely gives high quality of wine.
One things that notice that from seeing above plot, it seems that quality “4” appears any kinds of level of attributes, and it seems like there is no relationships to attributes, even though the sample of quality 4 is small.
By extracting only quality “4” and lower wine;
No, it looks there are very weak trends that as alcohol level get higher, quality 4 wine appearance decrease. But it looks like even volatile.acidity decrease, the appearance of quality 4 wine would not decrease for any level of volatile acidity. This is interesting.
Finally, I modify the dataset in the group of each quality of wine, and shows the conditional mean and median for each quality of the wine.
## Source: local data frame [6 x 5]
##
## quality.factor median.citric.acid median.sulphates median.alcohol n
## (fctr) (dbl) (dbl) (dbl) (int)
## 1 3 0.035 0.545 9.925 10
## 2 4 0.090 0.560 10.000 53
## 3 5 0.230 0.580 9.700 681
## 4 6 0.260 0.640 10.500 638
## 5 7 0.400 0.740 11.500 199
## 6 8 0.420 0.740 12.150 18
## Source: local data frame [6 x 5]
##
## quality.factor median.vola.acidity median.chlorides
## (fctr) (dbl) (dbl)
## 1 3 0.845 0.0905
## 2 4 0.670 0.0800
## 3 5 0.580 0.0810
## 4 6 0.490 0.0780
## 5 7 0.370 0.0730
## 6 8 0.370 0.0705
## Variables not shown: median.tot.sulfur.dio (dbl), n (int)
From the table above, I can confirm the previous plot analysis and directions of the attributes effect on quality.
Only variable I am not clear is total.sulfur.dioxide. I shows the negative relationships on quality of red wine, but it seems like low level of total sulfur dioxide also sored low level of quality.
The above plot, I add non-liner lm model (y = poly(x,2)) fitted line on the scatter plot. The line is concave, and it might be the case that the either high level of total.sulfur.dioxide and low one is determine the good and bad of quality of wine, with the combination of the other attributes.
First, I build the liner model (inducing log transformation of some attributes) to see the relationship of each attributes and quality of the wine.
##
## Calls:
## model1: lm(formula = f1, data = wine)
## model2: lm(formula = f2, data = wine)
## model3: lm(formula = f3, data = wine)
##
## ================================================================
## model1 model2 model3
## ----------------------------------------------------------------
## (Intercept) 21.965 15.238 0.351
## (21.195) (21.438) (0.548)
## fixed.acidity 0.025 0.032
## (0.026) (0.026)
## volatile.acidity -1.084*** -1.122*** -1.034***
## (0.121) (0.120) (0.101)
## citric.acid -0.183 -0.233
## (0.147) (0.146)
## residual.sugar 0.016 0.012
## (0.015) (0.015)
## chlorides -1.874*** -1.716*** -1.917***
## (0.419) (0.418) (0.399)
## free.sulfur.dioxide 0.004*
## (0.002)
## total.sulfur.dioxide -0.003***
## (0.001)
## density -17.881 -15.288
## (21.633) (21.601)
## pH -0.414* -0.360 -0.455***
## (0.192) (0.190) (0.118)
## sulphates 0.916*** 0.906*** 0.883***
## (0.114) (0.115) (0.111)
## alcohol 0.276***
## (0.026)
## log(free.sulfur.dioxide) 0.106** 0.119**
## (0.040) (0.039)
## log(total.sulfur.dioxide) -0.154*** -0.171***
## (0.041) (0.039)
## log(alcohol) 3.004*** 3.095***
## (0.283) (0.185)
## ----------------------------------------------------------------
## R-squared 0.4 0.4 0.4
## adj. R-squared 0.4 0.4 0.4
## sigma 0.6 0.6 0.6
## F 81.3 80.5 126.1
## p 0.0 0.0 0.0
## Log-likelihood -1569.1 -1572.2 -1573.9
## Deviance 666.4 669.0 670.4
## AIC 3164.3 3170.4 3165.8
## BIC 3234.2 3240.3 3214.2
## N 1599 1599 1599
## ================================================================
I created three kinds of model, model 1: is just the liner model of all attributes. model 2: is created the skewed attributes transformed by log. model 3: is created by using just statistically significant attributes.
Finally, I also construct 3 same models using new data set, wine.refine, which are “residual.sugar” and “chorides” and “sulphates” outliers removed by 1.5IRQ rules.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.900
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.0850 1st Qu.:1.900
## Median : 7.900 Median :0.5200 Median :0.2400 Median :2.100
## Mean : 8.228 Mean :0.5258 Mean :0.2544 Mean :2.184
## 3rd Qu.: 9.100 3rd Qu.:0.6350 3rd Qu.:0.4000 3rd Qu.:2.500
## Max. :15.000 Max. :1.3300 Max. :0.7500 Max. :3.600
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.0120 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.0690 1st Qu.: 8.00 1st Qu.: 23.00
## Median :0.0780 Median :14.00 Median : 37.00
## Mean :0.0779 Mean :15.69 Mean : 44.93
## 3rd Qu.:0.0865 3rd Qu.:21.00 3rd Qu.: 60.00
## Max. :0.1190 Max. :57.00 Max. :165.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.860 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9955 1st Qu.:3.220 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9966 Median :3.320 Median :0.6100 Median :10.10
## Mean :0.9966 Mean :3.324 Mean :0.6329 Mean :10.42
## 3rd Qu.:0.9976 3rd Qu.:3.410 3rd Qu.:0.7050 3rd Qu.:11.00
## Max. :1.0014 Max. :4.010 Max. :0.9900 Max. :14.00
## quality quality.factor
## Min. :3.000 3: 4
## 1st Qu.:5.000 4: 39
## Median :6.000 5:571
## Mean :5.643 6:546
## 3rd Qu.:6.000 7:156
## Max. :8.000 8: 15
## 'data.frame': 1331 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 15 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 65 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
wine.refine is the new dataset that excludes three obvious outliers from the original data set. Sample size now decreased to 1331 observations.
Constructing similar model from above.
##
## Calls:
## model4: lm(formula = f4, data = wine.refine)
## model5: lm(formula = f5, data = wine.refine)
## model6: lm(formula = f6, data = wine.refine)
##
## ================================================================
## model4 model5 model6
## ----------------------------------------------------------------
## (Intercept) 62.362* 57.369* 0.449
## (25.985) (26.148) (0.596)
## fixed.acidity 0.053 0.056*
## (0.028) (0.028)
## volatile.acidity -0.871*** -0.889*** -0.790***
## (0.129) (0.128) (0.106)
## citric.acid -0.261 -0.285
## (0.156) (0.154)
## residual.sugar 0.044 0.040
## (0.049) (0.049)
## chlorides -1.323 -1.468 -2.535*
## (1.301) (1.303) (1.212)
## free.sulfur.dioxide 0.004
## (0.002)
## total.sulfur.dioxide -0.003**
## (0.001)
## density -58.835* -57.153*
## (26.488) (26.290)
## pH -0.457* -0.440* -0.615***
## (0.204) (0.204) (0.122)
## sulphates 1.821*** 1.828*** 1.702***
## (0.164) (0.164) (0.153)
## alcohol 0.234***
## (0.032)
## log(free.sulfur.dioxide) 0.108* 0.131**
## (0.043) (0.042)
## log(total.sulfur.dioxide) -0.133** -0.158***
## (0.043) (0.041)
## log(alcohol) 2.498*** 2.996***
## (0.337) (0.199)
## ----------------------------------------------------------------
## R-squared 0.4 0.4 0.4
## adj. R-squared 0.4 0.4 0.4
## sigma 0.6 0.6 0.6
## F 80.9 80.5 124.6
## p 0.0 0.0 0.0
## Log-likelihood -1214.6 -1216.0 -1220.6
## Deviance 483.4 484.4 487.8
## AIC 2455.1 2457.9 2459.3
## BIC 2522.6 2525.5 2506.0
## N 1331 1331 1331
## ================================================================
It does not change anything in terms of adjusted R squire. But the BIC is smallest in model 6, so I chose model 6 is best predictive model.
I also plotted true quality of wine and predicted quality of wine by model 6.
It is seems like that model predict quality well. But there are some miss predictions.
Second plot, I transformed predicted quality to discrete factor integer by round by 0 digit.
It is interesting to see that, out of all positively related attributes on wine quality, citric.acid, sulphates, and alcohol, alcohol seems to have strongest positive relationship on red wine quality. Also, it is look like, even other positive attributes are low in score, if the alcohol level is high enough, wine was graded relatively high.
Total.sulur.dioxide, one of the negative attributes on the red wine quality, seems have non-liner distribution on the quality of red wine. Generally I can see as total.sulfur.dioxide gets low, the quality of wine increase, but, when the total.sulfur.dioxide has very low, I can see both good quality wine over 6, and also low quality wine 4.
I created 6 models, 3 with full sample, and 3 with outliers excluded sample. Each 3 model consist with simple liner model, log-transformed model, log-transformed with statistically significant attributes. As I suspected in plots, it is look like volatile.acidity, chlorides, sulphates, alcohol, free.sulfur.diocide, and total.sulfur.dioxides t values were significant, and signs (positive or negative) are correct. Last final plot was actually wine quality and predicted quality (rounding up the decimal number).
The R-2 was quite low of 40%, and my prediction only predict in the range of 5,6,7 quality, but generality, prediction and actual quality correspond each other.
The wine quality scales are from 1 to 10, but most of the wines fall into 5 and 6 quality. The mode quality is 5, median quality is 6, and mean quality is 5.636.
Above plot describe the relationship between the quality of wine and these strongest at attributes in it.
Left plot x-axis is the quality of wine, and y-axis is the log of alcohol % content in the wine. Inside plot the box plot describe the mean and percentile of the log of alcohol % content. As it can be seen that as the alcohol level increase, the quality of wine tends to increase. Although, it looks like if the quality level of wine is around 3-4, alcohol % level is high, but it might be misleading because of less sample size.
Right plot has same x-axis but the y axis is the amount of acetic acid in wine, volatile acidity, it is also known, “Vinegar taste”. It is clear to see that the level of “vinegar taste” decrease, the quality of wine increased.
From above plot, plotting two strong related with quality of wine, it is looks like, more alcohol and less vinegar tasted wine tends to have high quality of wine.
Last plot describe the relationships between Alcohol level, Vinegar taste, and the quality of wine. X-axis is the level of alcohol, Y-axis is the volatile acidity (Vinegar Taste) and the point are the distribution of wine colored by its quality.
As it was investigated in plot two, Alcohol level has positive effects on quality of wine, and Vinegar Taste (volatile acidity) has negative effects on quality of wine.
It is interesting to see the combination of the two attribution are important to decide the quality of wine.
For example, even if the alcohol level is very high around 2.4%-2.5% (log transformed), there are still tends to see the quality 3 or 4 wines if the level of Vinegar Taste is high.
And in vice versa, even if vinegar tastefulness is low, wine with low alcohol level tends to score 4 or 6 wine.
Using data set of 1599 wines and its 11 attributes which quality are graded by human from 1 to 10, I first see the distribution of the wine falls into mainly quality grade “5” to “6”.
Then I also investigate the relationship of the attributes with the quality of the red wine, and we found that the some seems have positive or negative effects on the quality of the wine but others not.
The level of the alcohol seems to have strong effects on the quality of the red wine, the higher the alcohol level, better quality of the red wine. Also, we found that volatile.acidity, which is also known as Vinegar Taste, has relatively strong negative effects on the quality of the wine.
Finally I develop the simple liner regression model to predict the quality of wine. The model seems to fit, but it can only explain the roughly 40% of the deviation. I also plot actual wine quality and predicted wine quality together. Even actual quality of the wine vary from 3 to 8 quality, model can only predict 4 to 6 quality.
Even my prediction model is limited accuracy, I can somewhat predict if the quality of of the wine would be better or worse than others by using model.
I also successfully plot to see the interesting relationship with wine quality and some of the attribute and can see the clear pattern.
However, because of the very unique distribution of wine in terms of quality, most of the red wine quality was 5 and 6, it was very difficult to see the more precise patterns or model.
Our independent variable, “quality”, is very obscure measurement, and I found very difficult to compare or construct model with objective numeric variable, such as attributes in the wines.
Althoght I construct the model using simple liner regression model, the variable like, “Quality”, which is more like subjective measurement, might not be the simple liner addition of attributes in wine. We might to construct more complex model such as non-liner or classification model.
Also, during my investigation, plotting and modeling, I omitted some outliers using 1.5IRQ rules to make plot to easy to see. But, If the there are new wine which as such outliers level of attributes, my investigation or model can not predict that quality as well. And it is seems like that these wine can have scarcity value and might be very high in quality or not.
Finally, if would like to explore more analysis on these red wine quality dataset, I would like to add 2 variables to know more about the red wine. First one is price of the wine; as I mentioned above, with subjective value like quality and mainly in 5 and 6 quality, replacing or adding price of the wine could see more details on the quality of wine (guessing higher the quality is more the expensive wine is). Second is the age of the wine; I think the taste of the wine could be change depending on how old is it, and see the relationship with quality and attributes in the wine, I think I can create more precise model that can predict quality or the price of wine.